[Kernel][Comms] feat: add custom all-gather kernels by AlpinDale · Pull Request #1524 · dphnAI/aphrodite-engine

AlpinDale · 2025-09-19T05:53:49Z

We don't really use all-gather all that much, but for context parallel, all-gather is used quite a lot. This adds a fair bit of overhead when doing Context Parallelism, sometimes halving the speed. Currently WIP.

CP will land in #1521

AlpinDale · 2025-09-19T13:44:35Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces custom all-gather kernels to optimize performance, particularly for context parallelism. The implementation is a good start, but I've identified several critical issues that need to be addressed. These include function signature mismatches that will cause build failures, a critical bug in handling variable-sized inputs due to incorrect use of std::set, and potential integer overflows from std::accumulate. Please review the detailed comments for fixes.

gemini-code-assist · 2025-09-19T13:46:53Z

+    }
+
+private:
+    std::set<int> mGroup;


Using std::set for mGroup introduces a critical bug and is inefficient. std::set sorts its elements, which will break the correspondence between the ranks and the sizes vector if the input group_ranks is not sorted. This leads to incorrect data gathering when sizes are variable. Additionally, iterating over the set using std::advance inside run_list is inefficient.

Please change mGroup to be a std::vector<int> to preserve the order of ranks and allow for efficient indexing. This requires changes in multiple places:

Change mGroup's type to std::vector<int> here.

Update AllgatherOp constructor (lines 72-74) to accept std::vector<int>.

Update init_custom_ag (lines 148-154) to construct a std::vector<int> from group_ranks and pass it to AllgatherOp.

Update the loop in run_list (lines 118-121) to use mGroup[root_idx] instead of std::advance.

std::vector<int> mGroup;

gemini-code-assist · 2025-09-19T13:46:54Z

 int64_t open_mem_handle(torch::Tensor& mem_handle);
 void free_shared_buffer(int64_t buffer);

+fptr_t init_custom_ag(const std::vector<int64_t>& group_ranks);


The signature of init_custom_ag is inconsistent with its implementation in custom_all_gather.cu and its declaration in custom_all_gather.cuh. The implementation expects a second argument nccl_comm_ptr of type int64_t, which is missing here. This will cause a linker error.

Suggested change

fptr_t init_custom_ag(const std::vector<int64_t>& group_ranks);

fptr_t init_custom_ag(const std::vector<int64_t>& group_ranks, int64_t nccl_comm_ptr);

gemini-code-assist · 2025-09-19T13:46:54Z

+TORCH_LIBRARY_EXPAND(CONCAT(TORCH_EXTENSION_NAME, _custom_ag), custom_ag) {
+  // Custom all-gather kernels
+  custom_ag.def(
+      "init_custom_ag(int[] group_ranks) -> int");


The TorchScript signature for init_custom_ag is inconsistent with its C++ implementation. It's missing the nccl_comm_ptr argument. This will cause a compilation or runtime error. The signature should accept an additional integer for the communicator pointer.

Suggested change

"init_custom_ag(int[] group_ranks) -> int");

"init_custom_ag(int[] group_ranks, int nccl_comm_ptr) -> int");

gemini-code-assist · 2025-09-19T13:46:54Z

+            std::all_of(sizes.value().begin(), sizes.value().end(),
+                [&sizes](int64_t size) { return size == sizes.value()[0]; });
+
+        int64_t sum_sizes = sizes.has_value() ? std::accumulate(sizes.value().begin(), sizes.value().end(), 0, std::plus<>{}) : 0;


The initial value for std::accumulate is 0, which is an int. Since the sizes vector contains int64_t values, the sum could overflow an int if it exceeds INT_MAX. The accumulator's type is determined by the type of this initial value. To prevent overflow, please use an int64_t initial value.

int64_t sum_sizes = sizes.has_value() ? std::accumulate(sizes.value().begin(), sizes.value().end(), int64_t{0}, std::plus<>{}) : 0;

gemini-code-assist · 2025-09-19T13:46:54Z

+                AT_CUDA_CHECK(ncclAllGather(input.data_ptr(), output.mutable_data_ptr(), input.numel(), (*getDtypeMap())[type],
+                    mNcclComm, stream));
+            } else {
+                size_t numel_base = std::accumulate(outputShape.cbegin() + 1, outputShape.cend(), 1, std::multiplies<>{});


The initial value for std::accumulate is 1, which is an int. The product of tensor dimensions can easily overflow an int. The accumulator's type is determined by this initial value. Please use a size_t initial value to prevent potential overflow, as the result is stored in a size_t.

size_t numel_base = std::accumulate(outputShape.cbegin() + 1, outputShape.cend(), size_t{1}, std::multiplies<>{});

[Kernel][Comms] feat: add custom all-gather kernels

9200e94

gemini-code-assist Bot reviewed Sep 19, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Kernel][Comms] feat: add custom all-gather kernels#1524

[Kernel][Comms] feat: add custom all-gather kernels#1524
AlpinDale wants to merge 1 commit into
mainfrom
custom_all_gather

AlpinDale commented Sep 19, 2025 •

edited

Loading

Uh oh!

AlpinDale commented Sep 19, 2025

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Sep 19, 2025

Uh oh!

gemini-code-assist Bot Sep 19, 2025

Uh oh!

gemini-code-assist Bot Sep 19, 2025

Uh oh!

gemini-code-assist Bot Sep 19, 2025

Uh oh!

gemini-code-assist Bot Sep 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	fptr_t init_custom_ag(const std::vector<int64_t>& group_ranks);
	fptr_t init_custom_ag(const std::vector<int64_t>& group_ranks, int64_t nccl_comm_ptr);

	"init_custom_ag(int[] group_ranks) -> int");
	"init_custom_ag(int[] group_ranks, int nccl_comm_ptr) -> int");

Uh oh!

Conversation

AlpinDale commented Sep 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlpinDale commented Sep 19, 2025

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Sep 19, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

AlpinDale commented Sep 19, 2025 •

edited

Loading